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Abstract 

International Agency for Research on Cancer (IACR) reported an increase in 
the worldwide cancer rate which is now known to be a major impediment to 
increasing life expectancy. Glioblastoma multiform, further named as astro- 
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— memes: cytoma, is a fast-growing truculent type of brain tumour that develops in the 
en ater eae is: cerebral hemispheres, mainly in the frontal and temporal lobes of the brain. 
Clustering: ’ According to the National Brain Tumor Society, GBM accounts for 49.1 per- 


cent of all primary malignant brain tumors. Despite advances in the available 
treatment options, there is not much improvement in overall patient survival 
rate and still ranges from 14.6 to 20.5months. Also, some individuals show 
adverse drug reactions due to their genetic composition, and the condition is 
called idiosyncrasy. The proposed work aims to find an effective treatment 
strategy for GBM patients on the basis of their clinical and genomic factors. 
The work is presented based on Genomic Data Commons (GDC), cBioportal 
and Cancer Browser dataset. Here we develop different patient cohorts based 
on the predictive features using K-means++ algorithm. A test patient acquires 
the treatment pattern of its most similar neighbour using patient similarity ana- 
lytics. This is a generalized approach that can be applied to any disease class 
where personal traits have impact on overall survival. 


Patient similarity; 
Cancer survival 


Association, GBM makes up almost half (49.1 per- 
cent) of all primary malignant brain tumors. The 
ratio of GBM prevalence is slightly higher in males 


1. Introduction 


Brain Cancer has consistently been a leading cause 


of death worldwide. However, the emergence of the 
COVID- 19 pandemic is likely to make cancer care 
more difficult and pose new challenges. They can 
be either benign or malignant. Every cancer type is 
unique, so early diagnosis can improve the median 
survival rate. Glioblastoma Multiforme (GBM) is 
a primary brain tumor found in adults that’s highly 
malignant and typically leads to just one year of sur- 
vival post-diagnosis (Krex et al.), (Hanif et al.). As 
per the 2022 statistics from American Brain Tumor 
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compared to females (Hanif et al.). Glioblastoma 
comes in four variations: classical, neural, proneu- 
ral and mesenchymal. These subtypes differ based 
on their genetic irregularities and the unique clini- 
cal features of each case (Varma and Jereesh), (W 
Verhaak et al.). Understanding the importance of 
personalized medicine and popularity of machine 
learning techniques in this field, we develop a treat- 
ment strategy for GBM patients considering their 
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unique clinical and genomic characteristics. These 
characteristics may not necessarily rely on a specific 
method of treatment. Clustering method used to 
form patient clusters. We used the concept of patient 
similarity to recognize individuals who resemble a 
reference patient and use the information from com- 
parable patient’s records to generate customised pre- 
dictions. Comparison of different clustering meth- 
ods to cluster patients has been presented in the lit- 
erature, which highlights the importance of select- 
ing an appropriate clustering technique based on the 
nature and characteristics of the data. Also, we anal- 
ysed different feature selection methods to generate 
the predictive feature list and the best method based 
on accuracy has been recommended. 


2. Literature Survey 


Glioblastoma Multiforme is most aggressive of all 
Glioma among the 4 grades. They are collection 
of tumors that originates within the central nervous 
system. According to Holland, Eric et al. (Holland 
and Multiforme) these gliomas are not cured by 
surgery alone because of its topologically diffuse 
nature. The standard treatment of GBM has been 
the same for many decades: surgical resection, radi- 
ation and chemotherapy. Even though many treat- 
ment approaches like gene therapy, infecting with 
viral vectors to kill tumor cells have been tested in 
animals for gliomas, but they seem to have no thera- 
peutic effect in humans. So Machine Learning (ML) 
models that predict treatment option based on indi- 
vidual characteristics can improve overall the over- 
all chances of survival. 


Kunal Malhotra et al. (M et al.) developed a 
treatment plan for patients with glioblastoma where 
logistic regression model with forward feature selec- 
tion method was used to extract 10 predictive fea- 
tures. A binary feature matrix with a target variable 
is formed from Clinical factors and genomic fea- 
tures and a target variable is formed based on patient 
survival period. 


Kunal Malhotra et al. (Malhotra et al.) redesigned 
the initial model (M et al.) .Here they used logis- 
tic regression and Cox Regression model for pre- 
diction. Age, Karnofsky Performance Score (KPS), 
neo-adjuvant treatment history, MGMT methylation 
status, GABRAI and TP53 gene expressions were 
identified as predominant features. Patients without 
date of diagnosis, pre-treatment history and missing 
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values for drug duration were excluded (Varma and 
Jereesh). Greedy forward feature selection is used 
to extract predominant factors. 

The system proposed by Ladha L et al. (Ladha 
and Deepa) suggested an empirical comparison of 
forward and backward feature selection methods 
and their algorithms. The forward selection starts 
with no variables and builds gradually, whereas 
backward selection works in reverse direction i.e. 
starts with complete feature set, and iteratively elim- 
inates the irrelevant features, until the closure con- 
dition is met. 

Among the different supervised ML algorithms 
used with forward and backward feature selection, 
the best performance is achieved when Support Vec- 
tor machine(SVM) is used with Sequential Back- 
ward Selection(SBS) to extract the predictive fea- 
tures. The identified predictive features included 
Gender, vital status, neoadjuvant treatment history, 
MGMT gene methylation status, and EGFR, NEFL, 
PDGFRA, RELB and TNFRSFIA gene expres- 
sions (Varma and Jereesh). 

When compared different variants of K-means 
clustering algorithm like x-means, global K-means 
and efficient k-means over colon and leukaemia 
datasets, initial choice of cluster centres plays a cru- 
cial role in determining quality of clusters (Kumar, 
Wasan, and Krishan). They found K-Means++ out- 
performs others due to its ability to select better 
cluster centres. 

According to Shirkhorshidi et al. (Shirkhorshidi, 
Aghabozorgi, and Wah) similarity measures are 
main components of distance-based clustering algo- 
rithms. Commonly used distance metrics are 
Minkowski distance, Average distance, Euclidean 
distance, Chord distance, Manhattans distance, Jac- 
card index, Mahalanobis distance, Cosine Similarity 
and Pearson Correlation. Euclidean distance mea- 
sure is widely used for numerical data. 

The system developed by Panahiazar_ et 
al. (Panahiazar et al.) to recommend treatment 
pattern for Congestive heart failure(failure(CHF) 
patients by considering information’s like lab 
results, age, gender, race, blood pressure read- 
ings, BMI, echocardiogram measurements and 26 
co-morbid conditions collected from Electronic 
Health Record(EHR) data. They used patient 
similarity analytics to predict medication .Patient 
cohorts were formed using two techniques. In the 
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first method, they used K-means and hierarchical 
clustering algorithm and in the second method, a 
supervised clustering approach was carried out. 
Finally, Mahalanobis distance is used to compute 
patient-cluster similarity. 


Chen et al. (Chen, Su, and Chang) developed 
a model to suggest a treatment system for dia- 
betes patient which followed a case based reason- 
ing approach along with ontology. The system used 
lifestyle related information to form diabetes care 
ontology. The system was more focused on the clin- 
ical factors of the patient and no genetic informa- 
tion was considered. Since type 2 diabetes is linked 
with family history along with environmental fac- 
tors, genetic information also needs to be taken into 
account. We need to create a rich CBR database to 
in order to identify similar patients. But it is hard to 
maintain such ontology. 


A model that predicts treatment plans for GBM 
using clinical, biomedical and imaging data was 
created. The model utilizes the fuzzy C-means 
clustering algorithm and Wrapper feature selection 
method (Ershadi, Rise, and Niaki). But, the fuzzy 
C means algorithm takes longer computational time 
compared to other clustering algorithms. 


(Ogbuabor and N) compared DBSCAN and K- 
means clustering algorithms on healthcare dataset 
and evaluated their performance based on silhou- 
ette score, clustering accuracy, and computational 
efficiency. They found Kmeans outperformed 
DBSCAN with a Silhouette score of 0.97. So it 
is advisable to use Kmeans or any of its advanced 
versions to create patient clusters and Euclidean dis- 
tance as similarity measure. 


3. Methodology 


The primary objective of this work is to identify 
personalised treatment plan for Glioblastoma. The 
approach used was Multidimensional Patient Simi- 
larity Analytics of Glioblastoma patients based on 
their clinical and genomic profile. Figure (Rong, Li, 
and Z Zhang) shows proposed model architecture. 
We collected data on patients with GBM and pre- 
pared it for analysis. We used SBS to identify 8 key 
factors. Based on their similarities in clinical and 
genomic characteristics, we grouped patients into 
clusters. When a patient is tested, they will receive 
treatment based on the treatment plan of the most 
similar one within their cluster. The system con- 


2023, Vol. 05, Issue 05S May 
sists of 5 stages: 1) Data Collection 2) Data Stan- 
dardization and Pre-processing 3) To find the predic- 
tive clinical and genomic feature. 4) Develop patient 


cohorts based on the predictive features. 5) Patient 
similarity assessment. 


Clinical & Genomic 
Data 


Feature selection 


Clustering 


| Treatment Data i 


Patient —cluster & 
Patient —patient | 


similarity assessment 


Test Patient 


FIGURE 1. Proposed Model Architecture 


Treatment plan 


4. Data Collection 


We collected sample data of about 300 GBM diag- 
nosed patients from GDC cBioportal (Cerami et 
al.) and Cancer Browser (Cline et al.). Clinical 
data consists of demographic information about the 
patients and valuable indicators regarding condition 
of patient. Genomic data were taken as genetics 
plays a pivotal role in drug responses which include 
Copy number variation of genes, mRNA expression 
levels and MGMT methylation status. Treatment 
data consist of sequence of drugs or therapies pre- 
scribed. 


5. Data Standardization and Pre-processing 


We removed samples with less than 50 percent 
data and standardized drug names. Some field like 
additional chemotherapy consist of values ’com- 
pleted’ and ’not applicable/ not known’ replaced 
with binary values 1 and O respectively. We con- 
verted Beta and M-values to methylation status (Du 
et al.). Records with missing either start or end date 
of drug is removed. 
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6. Determine predictive clinical and genomic 
features 


The data included both numeric and categorical data 
types. After data cleaning, a binary feature matrix 
was formed with a patient features. When SBS 
used for feature selection, 78 percent accuracy were 
obtained (Varma and Jereesh). 


6.1. Develop patient cohorts based on the 
predictive features 


The patients are categorized to different patient 
cohorts according to selected predictive fea- 
tures using a k-means++ clustering algo- 
rithm. k-means++ is an advanced version of 
k-means with better seeding. In order to find 
optimal value for k, Silhouette method was 
used (Rousseeuw), (Shahapure and Nicholas). 
Whenever a new patient comes, Euclidean distance 
between new sample and each cluster centroid was 
measured. The patient was allocated to the cluster 
with minimum distance with it and re-clusters each 
time. 


6.2. Patient Similarity Assessment 


The treatment features were sequence of 
drugs/radiation prescribed to patients. Patient- 
cluster and patient-patient similarity was estimated 
by using Euclidean distance. Patient -cluster 
distance is measured and a patient is assigned to 
cluster C with minimum distance. Then test patient 
similarity with all the other patients with survival 
as | (i.e. patient with median survival rate equal to 
more than 10 months) belonging to that particular 
cluster C was measured using Euclidean distance. 
The test patient adopted the treatment pattern of 
most similar patient. 


7. Results 


We have analysed about 235 patient samples diag- 
nosed with GBM. 205 samples were used for train- 
ing and 30 samples were used for testing. SBS is 
used to select the most predictive features, which 
were then given as input to an SVM classifier. Ini- 
tially, there were 8 clinical and 29 genomic fea- 
tures, but after SBS was applied only 3 clinical and 
5 genomic features remained as input for the SVM 
classifier. The outcomes showed a cross-validation 
accuracy of about 78 percent with both a 3-fold and 
5-fold approach. Table (Rong, Li, and Z Zhang) 
shows the 8 predictive features and their biologi- 


International Research Journal on Advanced Science Hub (IRJASH) 


2023, Vol. 05, Issue 05S May 


TABLE 1. Predictive features and their biologi- 
cal role. 


Predictive Role 
Feature 
Gender Male/Female. Female patients 


with GBM have a higher can- 
cer specific survival (CSS) after 
surgery (Tian et al.). 

Living-last follow-up > 365 days 
and Living last follow-up < 


Vital Status 


365days. 

History of Yes/No - Patients receiving 

neoadjuvant neoadjuvant treatment were 

treatment found to have longer survival 
rate. 

MGMT gene LM/M/HM: Abrasion in_ this 

Methylation region led to the loss of MGMT 

status protein expression, which in 
turn reduces the strength to 
repair DNA damage (Rivera 
et al.), (Hegi et al.). 

EGFR gene mutation of EGFR called 

expression EGFRvIII was observed which 
enhance the tumor growth, migra- 
tion, angiogenesis and metastatic 
spread its over expression led 
to decreased survival (Hatanpaa 
et al.), (Saadeh, Mahfouz, and 
I Assi), (Alentorn et al.). 

PDGFRA PDGFRA abnormalities were 


gene expres- associated with GBM Proneural 
sion subtype (W Verhaak et al.). 
PDGFRA over expression have 
a negative impact on overall 
survival rate(OS) and progression 
free survival rate (PFS) (Alentorn 
et al.). 

Patients with GBM mesenchymal 
subtype have increased RELB 
expression levels resulting in a 
shorter OS[28]. 

TNFRSFIA — Linked with immune cell infiltra- 
gene expres- tion of GBM and its high expres- 
sion sion results in low survival in 

GBM patients (Wang et al.). 


RELB gene 
expression 


cal role. Patient samples were divided into differ- 
ent cohorts based on the predictive features. Fre- 
quently occurring treatment patterns within the min- 
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TABLE 2. Evaluation Results 


Samples Number of Samples 
correctly predicted 22 
incorrectly predicted 8 


imum distance cluster were extracted from sam- 
ples with a positive survival. The test patient will 
acquire the treatment pattern with largest frequency 
within the minimum distance cluster. If all the 
samples with a positive survival within a cluster 
have same frequency, then find Euclidean distance 
between the test patient and each of the candidate 
samples. Finally, the test patient will acquire the 
treatment pattern of most similar candidate patient. 
Out of the 30 samples used for testing, 22 samples 
were correctly predicted. We used prediction accu- 
racy as a measure of performance. This is calcu- 
lated by dividing the total number of correctly pre- 
dicted instances by the overall number of instances 
(Table (Krex et al.)). 

Comparison with Existing System 

Heart failure therapy recommendation model 
developed by Panahiazar et al. (Panahiazar et al.) 
considered some patient specific variable as pre- 
dictive features and patient clusters were formed 
using k-means and hierarchical clustering methods. 
Mahalanobis distance is used as the similarity mea- 
sure. They obtained an accuracy of 71 percent 
with k-means clustering method and 73 percent with 
the hierarchical clustering method. The proposed 
method used k-means++ clustering method to form 
patient cohorts and Euclidean distance as similar- 
ity measure. We obtained an accuracy of 73.33 
percent. Malhothra et al. (M et al.), (Malhotra et 
al.) developed a system to predict treatment plan 
for GBM patients and they used KPS score, gen- 
der, age, MRNA expression levels of some genes 
such as TP53,PIK3R1,NF1,EGFR and so on as pre- 
dictive features. Recent study conducted by Wang 
et al. (Wang et al.) in 2022 identified that the 
TNFRSFIA gene expression levels in GBM cells is 
very high and have an impact on survival of GBM 
patients. According to the study conducted by Zeng 
et al.[28], patients with GBM mesenchymal subtype 
have increased RELB expression levels resulting in 
a shorter OS. We considered the expression levels 
of TNFRSFIA and RELB in our predictive feature 
set so that our system can better predict a optimal 
treatment plan. We used advanced K-means to form 
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patient cluster which provide better convergence. 


8. Conclusion and Future Scope 


GBM, also referred as grade IV astrocytoma, is the 
most aggressive class of brain tumor which spreads 
rapidly with an average survival of nearly 10- 15 
months. A major challenge in treating this fast 
growing cancer is to choose an ideal treatment strat- 
egy for patients after standard line of treatment. We 
identified the predominant clinical and genomic fac- 
tors using SBS. Patients were divided into different 
cohorts using k-means++ algorithms. While using 
any variants of k-means algorithm, finding an opti- 
mal value for k is difficult. The best k-value is 
selected using Silhouette method. A patient similar- 
ity approach is used to extract a clinical and genom- 
ically similar patient from the study patient. We 
recommend a treatment pattern based on the treat- 
ments adopted by most similar patient. The pro- 
posed approach is generic and if a strong data set 
is available, it can be applied to any area of the dis- 
ease in which clinical and genetic factors affect the 
survival rates. Due to the lack of sufficient data 
related to dosage of drugs, it is excluded from the 
study and can give better result if dosage of drugs 
is included. The accuracy of predictions heavily 
depends on the quality and size of the dataset used. 
Using a large enough dataset can result in better per- 
formance and more accurate predictions. We have 
limited the input features to clinical and genomic 
information. However, Data related to tissue anal- 
ysis, imaging scans and disease trends along with 
information on proteins can contribute to better sur- 
vival rate. 
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